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1. Introduction 


One of the major goals of parallel algorithm design for PRAM models is to come up with 
parallel algorithms that are both fast and efficient, i.e. that run in polylog time while the 
product of their time and processor complexities is within a polylog factor of the time 
complexity of the best sequential algorithm for the problem they solve. This goal has 
been elusive for many simple problems that are trivially in the class NC (recall that NC 
is the class of problems that are solvable in O(log°( : ) n) parallel time by a PRAM using a 
polynomial number of processors). For example, topological sorting of a DAG and finding 
a breadth-first search tree of a graph are problems that are trivially in NC, and yet it is not 
known whether either of them can be solved in polylog time with n 2 processors. 

This paper gives parallel algorithms for the string editing problem that are both fast and 
efficient in the above sense. We give a CREW-PRAM algorithm that runs in O(logmlogn) 
time with 0(mn/logm) processors, where m (resp. n) is the length of the shorter (resp. 
longer) of the two input strings. We also give a CRCW-PRAM algorithm that runs in 
O(logn(loglogm) 2 ) time with O(mn/loglogm) processors. In both algorithms, space is 
0(mn). 

In related work, Ranka and Sahni [17] have designed a hypercube algorithm for m = n 
that runs in 0{>/n iogn) time with n 2 processors, and have considered time/processor 
tradeoffs. In independent work. Matnies [15] has obtained a CRCW-PRAM algorithm for 
the edit distance that runs in O(logniogm) time with 0(mn) processors if the weight of 
every edit operation is smaller than a given constant integer. 

Recall that the CREW-PRAM model of parallel computation is the synchronous shared- 
memory model where concurrent reads are allowed but no two processors can simultaneously 
attempt to write in the same memory location (even if they are trying to write the same 
thing). The CRCW-PRAM differs from the CREW-PRAM in that it allows many processors 
to write simultaneously in the same memory location: in any such common-write contest, 
only one processor succeeds, but it is not known in advance which one. 

The rest of this introduction reviews the problem, its importance, and how it can be 
viewed as a shortest-paths problem on a special type of graph. 

Let x be a string of |x| symbols on some alphabet /. We consider three edit operations 
on x , namely, deletion of a symbol from x, insertion of a new symbol in x and substitution 
of one of the symbols of x with another symbol from I. We assume that each edit operation 
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has an associated noanegative real number representing the cost of that operation. More 
precisely, the cost of deleting from x an occurrence of symbol a is denoted by Z>(a), the cost 
of inserting some symbol a between any two consecutive positions of x is denoted by 1(a) 
and the cost of substituting some occurrence of a in x with an occurrence of b is denoted by 
S(a,b). An edit script on x is any consistent (i.e., all edit operations are viable) sequence o 
of edit operations on x, and the cost of o is the sum of all costs of the edit operations in a . 

Now, let x and y be two strings of respective lengths |x| and |y|. The string editing 
problem for input strings x and y consists of finding an edit script o' of minimum cost 
that transforms x into y. The cost of a' is the edit distance from x to y. In various ways 
and forms, the string editing problem arises in many applications, notably, in text editing, 
speech recognition, machine vision and, last but not least, molecular sequence comparison. 
For this reason, this problem has been studied rather extensively in the past, and forms 
the object of several papers (e.g. [13,14,16,18,20,19,25], to list a few). The problem is 
solved by a serial algorithm in 0(|x||y|) time and space, through dynamic programming 
(cf. for example, [25]). Such a performance represents a lower bound when the queries on 
symbols of the string are restricted to tests of equality [2,26]. Many important problems are 
special cases of string editing, including the longest common subsequence problem and the 
problem of approximate matching between a pattern string and text string (see [11,21.23] 
for the notion of approximate pattern matching and its connection to the string editing 
problem). Needless to say that our solution to the general string editing problem implies 
similar bounds for all these special cases. 

The criterion that subtends the computation of edit distances by dynamic programming 
is readily stated. For this, let C(i,j), (0 < i < |x|, 0 < j < |y|) be the minimum cost of 
transforming the prefix of x of length i into the prefix of y of length j. Let Sk denote the 
k - th symbol of string s. Then: 

C(i,j) = min{C(i - 1, j - 1) + S(x,-, y 7 ), C(i - 1 , j) + D(x,), C(i,j - 1) + /(&)}, 

for all i,j, (1 < i < |x|; 1 < j < |y|). Hence C(i,j) can be evaluated row-by-row or column- 
by-column in 0((x||y|) time [25]. Observe that, of all entries of the C-matrix, only the three 
entries C(i — 1 ,j — 1), C(i - l,j) and C(i,j — 1) are involved in the computation of the 
final value of C(i,j). Such interdependencies among the entries of the C-matrix induce an 
(|x| + 1) x (|y| + 1) grid directed acyclic graph (grid DAG for short) associated with the 
string editing problem. 
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Figure 1. Example of a 5 x 10 grid DAG. 


Definition 1 An m x n grid DAG is a directed acyclic graph whose vertices are the mn 
points of an mxn grid, and such that the only edges from grid point ( i,j ) are to grid points 
{i,j + 1), (*+ 1 ,j) and (t + 1,; + 1). 

Figure 1 shows an example of a grid DAG and also illustrates our convention of drawing 
the points such that point {i,j) is at the i- th row from the top and j-th column from the 
left. Note that the top-left point is (1, 1) and has no edge coming into it (i.e. is a source), 
and that the bottom-right point is (m, n) has no edge leaving it (i.e. is a sink). 

We associate an (|x| + 1) x (|y| + 1) grid DAG G with the string editing problem in the 
obvious way: the (|x| + l)(|y| + 1) vertices of G are in one-to-one correspondence with the 
(|x| + l)(|y| + 1) entries of the C-matrix, and the cost of an edge from vertex ( k , l) to vertex 
(i,j) is equal to J(j/;) if k = i and l = j - 1, to £>(x,) if k = i - 1 and l = j, to 5(x,-, yj) if 
k = i — 1 and l = j — 1. We can restrict our attention to edit scripts which are not wasteful 
in the sense that they do no obviously inefficient moves like: inserting then deleting the 
same symbol, or changing a symbol into a new symbol which they then delete, etc. More 
formally, the only edit scripts considered are those that apply at most one edit operation 
to a given symbol occurrence. Such edit scripts that transform x into y or vice versa are in 
one to one correspondence to the weighted paths in G that originate at the source (which 
corresponds to C(0,0)) and end on the sink (which corresponds to C(|x|, |y|)). Thus, in 
order to establish the complexity bounds claimed in this paper, we need only establish them 
for the problem of finding a shortest (i.e. least-cost) source-to-sink path in an m x n grid 
DAG G. Throughout, the left boundary of G is the set of points in its leftmost column. 
The right, top, and bottom boundaries are analogously defined. The boundary of G is the 
union of its left, right, top and bottom boundaries. 

The rest of the paper is organized as follows. Section 2 gives a preliminary CREW- 
PRAM algorithm for computing the length of a shortest source-to-sink path, assuming 
m = n. Section 3 gives an algorithm that uses a factor of logm fewer processors than 
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Figure 2. Illustrating how the problem is partitioned. 

the previous one and that will be later needed in our best CREW algorithm (given in 
Section 6). Section 4 sketches how to extend the previous algorithm to the case m < n. 
Section 5 considers computing the path itself rather than just its length. Section 6 gives 
our best CREW-PRAM algorithm. Section 7 gives the CRCW-PRAM algorithm. Section 
8 concludes. 

2. A preliminary algorithm 

Throughout this section, m = n, i.e. G is an m x m grid DAG. Let DISTc be a (2m) x (2m) 
matrix containing the lengths of all shortest paths that begin at the top or left boundary 
of G, and end at the right or bottom boundary of G. In this section we establish that the 
matrix DISTq can be computed in O(log 3 m) time, 0(m 2 ) space, and with O(m 2 /logm) 
processors by a CREW-PRAM. The preliminary algorithm that achieves this is intended 
as a “warmup” for the better algorithms that follow in later sections. The preliminary 
algorithm works as follows: divide the m x m grid into four (m/2) x (m/2) grids .4, B, C , D, 
as shown in Figure 2. In parallel, recursively solve the problem for each of the four grids 
A,B,C,D, obtaining the four distance matrices DISTa, DISTb, DISTc, DISTd • Then 
obtain from these four matrices the desired matrix DISTc ■ The main problem is how to 
perform this last step efficiently. 

The performance bounds we claimed for this preliminary algorithm would immediately 
follow if we can show that, for any integer q < m of our choice, DISTq can be obtained 
from DISTa , DISTb, DISTc, DISTd in time 0((g + logm)logm) and with 0(m 2 /g) 
processors. This is because the time and processor complexities of the overall algorithm 
would then obey the following recurrences: 

T(m) < T(m/2) + Ci(q + logm)logm 

P(m) < max(4P(m/2),C2m 2 /g), 
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with boundary conditions T(y/q) = C3 q and P{y/q) = 1, where 01,02,03 are constants. The 
solutions are T(m ) = 0((q + logm)log 2 m) and P{m) = 0(m 2 /q). Choosing q = logm 
would then establish the desired result. Therefore in the rest of this section, we merely 
concern ourselves with showing that DISTg can be obtained from DISTa, DISTb, DISTc, 
DISTd in time 0({q + logm) logm) and with 0(m 2 /q ) processors. 

Let DISTauB be the (3m/2) x (3m/2) matrix containing the lengths of shortest paths 
that begin on the top or left boundary of .4 U B and end on its right or bottom boundary. 
Let DISTcuD be analogously defined for C U D. The procedure for obtaining DISTg 
performs the following Steps 1-3: 

1. Use DISTa and DISTb to obtain DISTauB • 

2. Use DISTc and DISTd to obtain DISTcuD- 

3. Use DISTauB and DISTcuD to obtain DISTq. 

We only show how Step 1 is done, since the procedures for Steps 2 and 3 are very similar. 
First, note that the entries of DISTauB that correspond to shortest paths that begin and 
end on the boundary of A (resp. B) are already available in DISTa (resp. DISTb), and 
can therefore be obtained in O(q) time. Therefore we need only worry about the entries of 
DISTau3 that correspond to paths that begin on the top or left boundary of .4 and end 
on the right or bottom boundary of 3. Assign to every point v on the top or left boundary 
of .4 a group of m/q processors. The task of the group of m/q processors assigned to v is 
to compute the lengths of all shortest paths that begin at v and end on the right or bottom 
boundary of B. It suffices to show that it can indeed do this in time 0((q + logm) logm). 
Observe that: 

DISTaub( v , w ) = min{Z?isf^(u,p) + Distsi p, w) | 

p is on the boundary common to A and B } (1) 

Using (1) to compute DISTaub( v , w ) for a given v,w pair is trivial to do in time 0(q + 
log(m/g)) by using 0(m/q) processors for each such pair, but that would require an unac- 
ceptable 0(m 3 /q) processors. We have only m/q processors assigned to t; for computing 
DISTaub(v,w) for all w on the bottom or right boundary of B. Surprisingly, these m/q 
processors are enough for doing the job in time 0({q + log(m/g))logm). The procedure is 
given below. 
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Figure 3. Illustrating the procedure for computing the function 6. 


Definition 2 Let v be any point on the left or top boundary of A, and let w be any point on 
the bottom or right boundary of B. Let 0(v, w) denote the leftmost p which minimizes the 
right-hand-side of (\). Equivalently, 0(v, w) is the leftmost point of the common boundary 
of A and B such that a shortest v-to-w path goes through it. 


Define a linear ordering <g on the m points at the bottom and right boundaries of 
B , such that they are encountered in increasing order of <g by a walk that starts at the 
leftmost point of the lower boundary of B and ends at the top of the right boundary of B. 
Let Lb be the list of m points on the lower and right boundaries of B, sorted by increasing 
order according to the <b relationship. For any tt’i, W 2 € Lg, we have the following: 

If uji <b u> 2 then 6(v,wi) is not to the right of 0(v,W2). (2) 

Before proving property (2), we sketch how it is used to obtain an 0((g + log(m/qr))logm) 
time and 0{m/q) processor algorithm for computing DISTaub(v,w) for all w € Lb- We 
henceforth use 6(w) as a shorthand for 9{v, tn), with v being understood. It suffices to 
compute 0(w) for all w 6 Lg- The procedure for doing this is recursive, and takes as input: 

• A particular range of r contiguous values in Zb, say a range that begins at point a 
and ends at point c, a <g c, 

• The points 0(a) and 0(c ) , 

• A number of processors equal to max{l,(p + r)/q } where p is the number of points 
between 6(a) and 0(c) on the boundary common to A and B. (See Figure 3.) 
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Figure 4 . Illustrating the proof of property ( 2 ). 


The procedure returns 9(xv) for every a <3 u Kjc. If r = 1 then there is only one such 
w and there are enough processors to compute 9(w) in time 0(q + log (p/q)). If r > 1 then 
all of the max{l,(p + r)/q} processors get assigned to the median of the a-to-c range and 
compute, for that median (call it point b), the value 9(b) in time 0(q + log (p/q))- Because 
of (2), it is now enough for the procedure to recursively call itself on the a-to-6 range and 
(in parallel) the 6-to-c range. The first (resp. second) of these recursive calls gets assigned 
max{l,(pi + r/2)/q} (resp. max{l,(/>2 4 - r/2)/^}) processors, where pi (resp. p 2 ) is the 
number of points between 9(a) and 9(b) (resp. between 9(b) and 9(c)). Because P 1 +P 2 = p, 
there are enough processors available for the two recursive calls. (See Figure 3.) In the 
initial call to the procedure, it is given (i) the whole list Lg, (ii) the 9 of the first and last 
point of Lb, and (iii) 3m/2q processors. The depth of the recursion is logm, at each level 
of which the time taken is no more than 0(q + log (m/q)). Therefore the procedure takes 
time 0((q + \og(m/q))\ogm) with 0(m/q) processors. We conclude that the preliminary 
solution would immediately follow if we establish (2). 


We prove (2) by contradiction: Suppose that, for some wi,w^ € Lb, we have w\ <b u> 2 
and 0 (u>i) is to the right of #(102), as shown in Figure 4 . By definition of the function 9 
there is a shortest path from v to u?x going through 9(w 1) (call this path a), and one from v 
to u? 2 going through 9(wi) (call it /3). Since w\ <3 in 2 and 0 (tz>i) is to the right of 9(w2), the 
two paths a and (3 must cross at least once somewhere in B : let z be such an intersection 
point. See Figure 4 . Let prefix(a) (resp. prefix(0)) be the portion of a (resp. (3) that 
goes from v to z. We obtain a contradiction in each of two possible cases: 


• Case 1. The length of prefix(a) differs from that of prefix({3). Without loss of 
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generality, assume it is the length of prefix(/3) that is the smaller of the two. But 
then, the u-to-u>i path obtained from a by replacing prefix(a) by prefix(/3) is shorter 
than a, a contradiction. 

• Case 2. The length of prefix(a) is same as that of prefix (/?). In o, replacing 
prefix(a) by prefix{j3) yields another shortest path between v and w\, one that 
crosses the boundary common to A and B at a point to the left of 0(u>i), contradicting 
the definition of the function 0. 

This completes the proof of (2). 

A referee pointed out that ideas similar to those in this Section were independently found 
by Baruch Schieber and Uzi Vishkin. 

3. Using fewer processors 

This section gives an algorithm that has same time complexity as that of the previous 
section, but whose processor complexity is a factor of log m better. This is more than a 
mere “warmup” for our best CREW algorithm of Section 6: the algorithm of Section 6 will 
actually use the technical result, given in this section, that DISTauB can be obtained from 
DISTa and DISTb with 0(m 2 ) total work. 

We establish the following lemma. 

Lemma 1 Let G be an m x m grid DAG. Let DISTg be a (2m) x (2m) matrix containing 
the lengths of all shortest paths that begin at the top or left boundary of G, and end at the 
right or bottom boundary of G. The matrix DISTg can be computed in 0(log 3 m) time, 
0(m 2 ) space, and with 0(m 2 / log 2 m) processors by a CREW-PRAM. 

We prove the above lemma by giving an algorithm whose processor complexity is a log m 
factor better than that of the preliminary solution of Section 2. We illustrate the method 
by showing how DISTavB can be obtained from DISTa and DISTb in O(log 2 m) time 
and 0(m 2 / log 2 m) processors. The preliminary procedure for computing DISTavB can 
be seen to do a total amount of work which is 0(m 2 log m). Our strategy will be to first 
give a procedure which has same time and processor complexities as the preliminary one, 
but which does a total amount of work which is only 0(m 2 ). Our claimed bounds for the 
computation of DISTavB from DISTa and DISTb will then follow from this improved 
procedure and from Brent’s theorem [5]: 
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Theorem 1 (Brent) Any synchronous parallel algorithm taking time T that consists of a 
total ofW operations can be simulated by P processors in time 0((W/ P) + T). 

Proof. See [5]. • 

There are actually two qualifications to Brent’s theorem before one can apply it to a 
PRAM: (i) at the beginning of the i-th parallel step, we must be able to compute the 
amount of work W t - done by that step, in time 0(Wi/P) and with P processors, and (ii) we 
must know how to assign each processor to its task. Both (i) and (ii) will trivially hold in 
our framework. 

Let La and <,* be defined analogously to Lb and <b, respectively. In other words, La 
is a list of the m points on the left and top boundaries of A , sorted in the order in which 
they are encountered by a walk that starts at the lowest point of the left boundary of A 
and ends at the rightmost point of the top boundary of A (i.e. sorted by increasing order 
according to the <a relationship). A symmetric version of (2) holds, i.e., for any w € Lb 
and any two points vj and v 2 of La, we have the following: 

If t>i <a V 2 then 9(v i,w) is not to the right of 8{v 2 ,tn). (3) 

The proof of (3) is identical to that of (2) and is therefore omitted. 

Let P be the m x (m/2) submatrix of DISTa containing the lengths of the shortest 
paths that begin at the top or left boundary of .4, and end at its bottom boundary. Let Q 
be the (m/2) x m submatrix of DISTb containing the lengths of the shortest paths that 
begin at the top boundary of B, and end at its bottom or right boundary. By definition, 
the rows of P are indexed by the entries of La, the columns of Q are indexed by the entries 
of Lb, and the columns of P (hence the rows of Q) are indexed by the m/2 points at the 
common boundary of A and B, sorted from left to right. The problem we face is that of 
“multiplying” the m x (m/2) matrix P and the (m/2) x m matrix Q in the closed semiring 
(min, +). In matrix terminology, 8(v,w) is the smallest index k, 1 < k < m/2, such that 
PQ(v,w) = P{v,k) + Q{k,w). We give the procedure below for the (more general) case 
where P is an l x h matrix, and Q is an h x t matrix, l < 2 h. The only structure of 
these matrices that our algorithm uses is the following property (4), which is merely a re- 
statement of properties (2) and (3) using matrix terminology: 

V(1 < vi < v 2 < l, 1 < w < £),6(vi,w) < 9(v 2 ,w), and 8(xv,Vi) < 0(tc,t> 2 ). (4) 
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To compute the product of P and Q in the closed semiring (min, +), it suffices to compute 
9(v,w ) for all 1 < v,w < £. To compute the product PQ (i.e. the function 9), we use the 
following procedure which runs in 0(\og£ logh) time and 0(£hj log/i) processors and 0(£h) 
total work: 

1. Recursively solve the problem for the product P'Q' where P' (resp. Q') is the (£/2) x h 
(resp. h x (£/ 2)) matrix consisting of the odd rows (resp. odd columns) of P (resp. 
Q). This gives 9(v, w) for all pairs (v, w) whose respective parities are (odd, odd). If 
Work(£,h) and T(£,h ) denote the total work and time for this procedure, then this 
step does Work(£/2,h ) work in T(£/2,h) time. 

2. Compute 9(v, w ) for all pairs (v, w) of parities (even, odd). This is done as follows. In 
parallel for each odd w, assign h/logh processors to w, with the task of computing 
9(v, w) for all even v. The fact that we already know 9(v, w) for all odd v, together 
with property (4), implies that these h/logh processors are enough to do the job in 
O(log/i) time. The work done is then 0(h) for each such w, for a total of 0(£h ) work 
for this step. 

3. Compute 9(v, w ) for all pairs (v, w) of parities (odd, even). The method used is iden- 
tical to that of the previous step and is therefore omitted. 

4. Compute 9(v, w) for all pairs (v, w ) of parities (even, even). The method is very similar 
to that of the previous two steps and is therefore omitted. 

The time, processor, and work complexities of the above method satisfy the recurrences: 

T(£, h) < T(£/2, h) + ci logh, 

P(£,h) < max{P(£/2,h),£h/logh], 

Work(£, h) < Work(£/2 , h) + c 2 £h , 

-where c\ and c 2 are constants. These recurrences imply that T(£,h) = O(logflogh), 
P(£,h) = 0(£h/ logh), and Work(£,h) = 0(£h). This, together with Theorem 1 (Brent’s 
theorem) in which T = log £ log h, P = £h/q, and W = £h, implies that the above algorithm 
can be simulated by £h/q processors in 0(q + logflogh) time. In our case, we have £ = m 
and h = m/2, implying that PQ (and hence DIST^ub) can be obtained from P and Q in 
0(q + log 2 m) time with 0(m 2 /q ) processors. 
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Figure 5. Illustrating Lemma 2. 

The above method enables us to obtain DISTq from DISTa,DISTb, DISTqiDISTq 
in 0(q 4 log 2 m) time and 0(m 2 fq) processors. This implies that the overall divide-and- 
conquer algorithm runs in 0((q + log 2 m)logm) time with 0(m 2 /q) processors. Choosing 
q = log 2 m establishes Lemma 1. 

4. The case m < n 

This section generalizes the algorithm for the case m < n. The main result is the following. 

Theorem 2 Let G be an mx n grid DAG, m < n. The length of a shortest source-to-sink 
path in G can be computed by a CREW-PRAM in 0(lognlog 2 m) time, 0(mn ) space, and 
with 0(mn/ log 2 m) processors. 

Note that, if G is m x n with m < n, then using the same idea as in Section 3 would 
result in an unacceptable (m 4- nli'm -f n)/log 2 (m 4- n) processor complexity, the DISTq 
matrix we are computing now being (m4n) x(m-rn). In order to prove our claimed bounds, 
we shall abandon the goal of computing such a matrix DISTq and settle for computing a 
Dq matrix that contains less information than DISTq, but enough to obtain the desired 
quantity: the length of a shortest source-to*sink path in G. 

Definition 3 For any mx n grid DAG G, m < n, let Dq be the mx m matrix containing 
the lengths of all the shortest paths that begin at the left boundary of G, and end at the right 
boundary of G. 

Note that Dq is a submatrix of DISTq. 

The following lemma is another ingredient that we need. 

Lemma 2 Let G be an m x m' grid DAG that is partitioned by a vertical line into G\ 
and Gi- (See Figure 5.) Then, given Dq x and Dq 7 , the matrix Dq can be computed by a 
CREW-PRAM in 0(log 2 m) time, 0(m 2 ) space, and with 0(m 2 / log 2 m) processors. 
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Figure 6. Illustrating the partitioning of C. 


Proof. The algorithm proving the above lemma is similar to the procedure we used in 
Section 3 to obtain DISTauB from DISTa and DISTb, and is omitted. • 

We are now ready to prove Theorem 2. 

Proof of Theorem 2. Without loss of generality, assume that m divides n (if not then 
G can always be “padded” with extra vertices and zero-cost edges so as to make it m x n' 
where m divides n' and n' — n < m). Partition G by vertical lines into n/m grid DAGs 
G x, . . . ,G„/ m , where each G,- is m x m (see Figure 6). In parallel for each i € {1, . . . , n/m}, 
use Lemma 1 to obtain the DISTg, matrices. This takes O^og 3 m) time with a total of 
0((m 2 / log 2 m)(n/m)) = 0(mn/ log 2 m) processors. From each DISTg, matrix, extract 
its submatrix We are now left with the task of combining the Dg,' s into a single Da- 
hl parallel, we recursively obtain the D-matrix of the union of the leftmost n/2 m G,-’s, and 
similarly the D-matrix of the union of the rightmost n/2m G,’s. We then combine these two 
D matrices into Da bv using Lemma 2. This recursive combining procedure takes a total 
of G(log 2 m log(n/m)J time with 0(mn/ log 2 m) processors. The overall time complexity is 
therefore 0(log 3 m -f log 2 mlog(n/m)) = O(lognlog 2 m). • 

In view of the remarks made in Section 1, the following is an immediate consequence of 
the above theorem. 

Corollary 1 Let x and y be two strings over an alphabet I. Let m = min(ji|, |y|), n = 
max(|i|, |y|). For edit operations of arbitrary nonnegative costs, the edit distance from x 
to y can be computed by a CREW-PRAM in O(lognlog 2 m) time, 0{mn) space, and with 
0(mn/ log 2 m) processors. 

5. Computing the actual path 

In this section we sketch a modification of the algorithm given in the previous sections 
which enables us to compute an actual shortest source-to-sink path in G within the same 
time, space, and processor bounds as in the length computation. 
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Figure 7. Illustrating the computation of the actual path. 

Theorem 3 Let G be an m x n grid DAG , m < n. A shortest source-to-sink path in 
G can be computed by a CREW-PRAM in O(lognlog 2 m) time , O(mn) space , and with 
0(mn/ log 2 m) processors . 

The rest of this section proves the above theorem. 

We begin with the case m = n, i.e. an m x m grid DAG. We cannot afford to let the 
matrix DISTq of Section 3 be a matrix of paths instead of lengths, because that would take 
m 3 space, killing any hope of a polylog time algorithm that does not use an almost cubic 
number of processors. Instead, we modify the algorithm of Section 3 so that it also has the 
“side effect” of computing two (2m) x (2m) matrices HCUTq and VCUTq (mnemonics for 
“horizontal cut” and “vertical cut”, respectively) having the same index domain as DISTq . 
These tw'o matrices are global in the sense that they remain even after the recursive call 
returns, and their significance is as follows. Let H be the horizontal boundary between 
A U C and B U D y and let V be the vertical boundary between AU B and CU D (see Figure 
7). Let PATH(x , y ) be the lowest x-to -y path of cost DISTg{x, y); i.e. no other x-to-y path 
of length DISTc(x,y) goes through any vertex that is below a vertex of P AT H(x,y). It 
is easy to prove that there is a unique such path PATH(x, y) (the proof is straightforward 
and is omitted). Then HCUTo(x,y) is the leftmost intersection of PATH(x,y) with H, 
and VCUToixyy) is the lowest intersection of PATH(x,y) with V. If the intersection 
of PATH(Xyy) with H (resp. V ) is empty, then HCUTg{x,v) (resp. VCUTa(x,y)) is 
undefined. Because these additional matrices are global, after the algorithm terminates it 
leaves behind N(m) of them where 

N(m) = AN {ml 2) + 2 = 0(m 2 ). 
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Fortunately, even though there are 0(m 2 ) such HCUT and VCUT matrices that remain, 
the total storage space they take is S(m ) where 

S(m) = 4S(m/2) + cm 2 = 0(m 2 log m). 

Before showing how S(m ) is decreased to 0(m 2 ), we show how the matrices HCUT and 
VCUT are used to retrieve the shortest source- to-sink path in G. It suffices to output the 
points on this path as a set (i.e. in arbitrary order), since a postprocessing sorting step 
puts them in the right order in O(logm) time and 0(m ) processors [6]. Let s and t denote 
the source and sink of G, respectively. We first print HCUTo(s,t) and VCUTa(s,t), and 
then we recursively print the three portions of the shortest s-to-t path determined by its 
two intersections with H and V (this involves three (m/2) x (m/2) grid DAGs; see Figure 
7). The procedure can be implemented to run in 0{h 4- logm) time and 2 m/h processors, 
where h < m is an integer of our choice, by maintaining the property that each recursive 
call of size m' > h gets assigned 2m' /h processors (the bottom of the recursion is when 
problem size m' becomes < h, at which time a single processor finishes the job sequentially, 
in O(m') time). (We would, of course, choose h = logm.) 

We bring the space complexity 5(m) down by storing each row (say, row p ) of the HCUT 
(or VCUT ) matrix in an 0(m)-bit vector ROW(p) that is “packed’' in 0(m/ log m) registers 
of size 'log m bits each. (The assumption that word size is a logarithmic function of problem 
size is a standard one [3].) Let us immediately point out that a consequence of this encoding 
scheme is that we now have S(m) = 0(m 2 ). To see this, let BITS(m) be the total number 
of bits used by the encoding scheme, and note that 5(m) = 0(BITS(m)/ logm), since each 
register contains logm bits. Thus it suffices to show that BITS(m) = 0(m 2 logm). But 
this trivially follows from the fact that BITS(m ) = 4BITS(m/2 ) + 0(m 2 ). 

We now describe the encoding scheme used for storing row p of (e.g.) HCUT in the 
0(m)-bit vector ROW{p). We exploit the fact that the contents of row p happen to be sorted 
by the left-to- right linear ordering of the points on H. More precisely, if the points of H 
are denoted by 1, . . ., m in left-to-right order, then row p contains a nondecreasing sequence 
of 0(m) integers between 1 and m. Instead of storing the entries of row p, we therefore 
store the sequence of differences between the consecutive entries of row p. This sequence of 
differences is stored in unary in the 0(m)-bit vector ROW(p), with as many consecutive l’s 
as needed to encode a particular difference, and using a 0 as a separator between consecutive 
non-zero entries. For example, if row p contains the sequence (3,3,5,7,9,11) then the 
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sequence of differences is (3, 0,2, 2, 2, 2) and ROW(p) = (11100110110110110). We can 
actually obtain ROW(p) without going through the intermediate step of computing the 
sequence of differences: simply observe that if the i-th entry of row p is k then the (i -f k) th 
entry of ROW(p) is a 0 (in our example, the fourth entry is 7 and hence the eleventh entry 
of ROW(p) is a 0). This observation implies that we can obtain ROW(p) in 0(q + logm) 
time with 0(m/q) processors by first initializing all the entries of ROW(p) to 1, and then 
changing some of these into 0’s according to the observation. Reading the k- th entry of 
row p is now done by computing the sum of all the entries of ROW(p) that precede its 
fc-th leftmost zero; i.e. it requires a parallel prefix computation [10] on ROW(p) and hence 
O(logm) time, so that extracting the s-to-t path now takes 0(log 2 m) time rather than the 
previous O(logm). This fact is of no consequence, however, since the bottleneck in the time 
complexity comes from the computation of the DISTq matrix. 

This completes the proof of Theorem 3 for the case m = n. 

It is not hard to see that, so long as m = n, the above procedure actually works when 
s and t are arbitrary points on the boundary of G. This observation implies that, for 
the case m < n, it suffices to find for every i € {1, . . (n/m) — 1} the lowest point (call it 
CROSS(i)) at which a shortest path from s to t crosses the boundary between Gi and Gi+\. 
Once we have these CROSS(iy s, we can use the procedure of the previous paragraph to 
obtain the actual path joining each CROSS(i) to CRGSS(i -f 1) in time CKlog 3 m). space 
0(m 2 n/m) = 0(mn ) and with 0((m 2 /log 2 m)(n/m)) = 0(mn/ log 2 m) processors. We 
obtain the CROSS(i)' s as follows. Refer to Section 4, the proof of Theorem 2: We modify 
that procedure so that, as the procedure computes the JD-matrix, it now also produces as a 
side effect a global m x m matrix CUTq • The significance of this matrix is that CUTa(x, y) 
is the lowest point of intersection of any shortest x-to-y path with the boundary separating 
the two recursive calls. The total number of such CUT matrices is 0(n/m), and their total 
storage is 0(mn). We use these CUT matrices to output the CROSS(iys as a set (i.e. 
unordered) by first printing CUTg(s , t), and then recursively printing the CROSS(iys that 
are to the left of CUTo(s,t ), and simultaneously (i.e. in parallel) those to its right. It is 
easily seen that the CROSS(iys are produced in time O(log(n/m)), and that there are 
enough processors to carry out the procedure. A post-processing sorting step orders the 
CROSS(iys. This completes the proof of Theorem 3. • 

An immediate consequence of Theorem 3 is the following. 
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Corollary 2 Let x and y be two strings over an alphabet /. Let m = min(|x|, \y\), n = 
max(|x|, |j/|). For edit operations of arbitrary nonnegative costs, an optimal edit script from 
x to y can be computed by a CREW-PRAM in O(lognlog 2 m) time, 0(mn) space, and with 
0 (mn/ log 2 m) processors. 

6. A faster CREW-PRAM algorithm 

This section gives a CREW algorithm that is faster by a logm factor and uses 0(mn/ logm) 
processors. More precisely, we establish the following. 

Theorem 4 Let G be an m x n grid DAG, m < n. A shortest source-to-sink path in 
G can be computed by a CREW-PRAM in O(lognlogm) time, 0(mn ) space, and with 
0(mn/logm) processors. 

Corollary 3 Let x and y be two strings over an alphabet I. Let m = min(|x|, |y|), n = 
max(|x|, 1 2 /|). For edit operations of arbitrary nonnegative costs, an optimal edit script from 
x to y can be computed by a CREW-PRAM in O(lognlogm) time, 0(mn) space, and with 
0(mn/logm) processors. 

From the developments of sections 2-5, it should be clear that in order to establish the 
above theorem, it suffices to show that: 

1. The matrix DISTauB can be obtained from DISTa and DIST 3 in O(logm) time, 
0 (m 2 ) space, and with 0(m 2 /logm) processors, and 

2. The matrix Dq can be obtained from Dq , and Dq 2 (see Definition 3 and Figure 0 ) 
in O(logm) time, 0(m 2 ) space, and with O(m 2 /logm) processors. 

Since the proofs of 1 and 2 are very similar, we only give that for 1. Thus the rest of 
this section deals with how to obtain DISTauB from DISTa and DISTb in O(logm) time, 
0(m 2 ) space, and with 0 (m 2 ) processors. 

6.1. Obtaining one row of DISTauB 

This subsection gives an O(logm) time, O(mlogm) space, and O(mlogm) processor al- 
gorithm for obtaining one particular row of DISTauB , i-e- computing 9(v,w) for a fixed 
v € La and all w € Lb- The fixed vertex v is implicit in the rest of this subsection, so that 
whenever we refer to a “path to w” it is understood that this path originates at v. 
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We refer to the vertices on the boundary common to A and B (denoted Af)B for short) 
as crossing vertices and number them cj,C 2 , . . . ,c m / 2 , where the numbering is from left to 
right along the common boundary. We refer to the vertices in Lb as destination vertices 
and denote them w 2 , w m , numbered according to <b, their order in Lb - 

Definition 4 A crossing interval is a non-empty set of contiguous crossing vertices {c;, Ci+\, 
• • • > c j}* 

We say that crossing interval I is to the left 0 / crossing interval J, and J is to the right of 
/, if the rightmost vertex of I is to the left of the leftmost vertex of J. 

Definition 5 Let F C AC\B and w € Lb, ».e. F is a set of crossing vertices (not necessarily 
an interval) and w is a destination vertex. Let ffp(w) denote the leftmost crossing vertex 
in F incident to a ( v , w) path that is shortest among all ( v , w ) paths constrained to pass 
through F. 

Note that 0 f(u>) may differ from 0(v,w), but that ^Ans(u') = 0(v,w). 

The following lemma is the analogue, for constrained paths, to property (2) of Section 2. 

Lemma 3 Let F C A fl B and w\,w 2 € Lb- If w 1 <B then 6p(w 1 ) is not to the right 

of B f {w 2 ). 

Proof. Identical to that of property (2) of Section 2, and hence omitted. • 

We now give an informal description of the algorithm. 

If U is any set of destination vertices and I is any crossing interval, then we will define 
Oi(U) to be a data structure that contains enough information to determine 9[(w) for all 
w 6 U. The details of that data structure will be explained later. 

It is useful to think of the computation as progressing through the nodes of a tree T 
which we now proceed to define. 

We define a crossing interval to be diadic if it is either Af\B (i.e. it consists of all crossing 
vertices), or if it is the the left or right half of a diadic crossing interval. Note that there 
are exactly m — 1 diadic crossing intervals, which form a complete binary tree T rooted at 
A D B, and whose m/2 leaves are the m/2 crossing vertices (the t-th leaf of T containing c,-, 
the i-th leftmost crossing vertex). Thus the diadic crossing interval at an interior node of 
T is simply the union of the diadic crossing intervals of its two children in T. We can talk 
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about the height and the children of a diadic crossing interval (= its height and children in 

n 


Since the m — 1 diadic crossing intervals are the only crossing intervals we shall be 
interested in, from now on we simply say “interval” as a shorthand for “diadic crossing 
interval”. Thus whenever we refer to an interval I we are implicitly assuming that I 6 T, 
i.e. that / is one of the m — 1 diadic crossing intervals. We use |/| to denote the size of 
the interval, i.e. the number of crossing vertices in it. Observe that XT/gT |F| = O(mlogm). 
Thus we have enough processors to associate |/| of them with each interval I (i.e. node I) of 
T. Similarly, we can afford to use 0(\I\) space per interval I. The computation proceeds in 
2 log m - 1 stages , each of which takes constant time. The ultimate goal is for every interval 
I to compute 0j(Lb)- The structure of the algorithm is reminescent of the cascading divide - 
and-conquer technique [6,4]: each I € T will compute 9[{U) for progressively larger subsets 
U of Lb, subsets U that double in size from one stage to the next of the computation. We 
now proceed to state precisely what these subsets are. 


Definition 6 A fc-sample of Lb is obtained by choosing every k-th element of Lb ("i.e. 
every element whose rank in Lb is a multiple of k). For example, a 4-samp/e of Lb is 
(u? 4 , u?8, . . . , u? m ). For k 6 {0, 1, . . . , logm}, let Uk denote an ( m/2 k )’$ample of Lb- 


For example: 

£ 0 = {^’m } > 

U\ = 

&2 ~ {^m/4? ^m/2* ^3m/4? ^m}i 

Uz — {^m/8? ^m/4? ^3m/8i ^m/2i ^5m/8» ^3m/4> ^7m/8> i 
£^logm — ^2> • • • i ^m} — Lb- 

Note that \Uk\ = 2 fc = 2|[/*_i|. 

At the t - th stage of the algorithm, an interval I of height h in T will use its |/| processors 
to compute, in constant time, 9j{Ut^h) if h < t < h + logm. It does so with the help of 
information from 0 I (U i ^ h ), 0LeftChiid(l)(Ut-h), and 0Ri 9 htChiid(l){U t ^ J, all of which are 
available from the previous stage t — 1. l£h> t oit > h + log m then interval I does nothing 
during stage t. Thus before stage h the interval I lies “dormant”, then at stage t = h it first 
“wakes up” and computes 9j(Uo ), then at the next stage t = h + 1 it computes 0i(U\), etc. 
At step t = h + logm it computes 9i(U \ osm ), after which it is done. The details of what 
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information / stores and how it uses its |/| processors to perform stage t in constant time 
are given below. First, we observe the following. 

Lemma 4 The algorithm terminates after 21ogm — 1 stages. 

Proof. After stage h + logm every interval I of height h is done, i.e. it has computed 
Bj(Lb)- The root interval has height logm - 1 and thus is done after stage 2 logm — 1. • 
Thus to establish the main claim of this subsection, it suffices to prove the following 
lemma. 

Lemma 5 With |J| processors and 0(|J|) space assigned to each interval I € T, every stage 
of the algorithm can be completed in constant time. 

The rest of this subsection proves the above lemma. 

We begin by describing the way in which an interval I at height h in T stores 6/(U t -h) 
using only |/| space. Rather than directly storing the values 0/(w) for all w £ Ut-h (which 
would require space), we store instead the inverse mapping, which turns out to have a 

compact 0(|/|) space encoding because of the monotonicity property guaranteed by Lemma 
3. In other words, for each c £ I, let 

= O € U t -h\0/(w) = c}. 

Then Lemma 3 implies that the elements of ~/(c,t) are contiguous in the list Ut-h- More 
specifically, the sets 7T/(c,f), c £ I, form a partition of the set Ut-h into \I\ subsets each of 
which is either empty or contains contiguous elements in Ut-h • Therefore / does not need 
to store the elements of 7t/(c, t) explicitly, but rather by just remembering where they begin 
and end in U t ~h, i.e. 0(1) space for each c £ I. Of course Ut-h is itself not stored explicitly 
by I, since the height h and stage number t implicitly determine it. Thus 0(|/|) space is 
enough for storing i r/(c, t ) for all c £ I. 

Interval I stores the sets 7 r/(c, t),c £ I, in an array RANGE/, with entries RAN GE/(c) = 
( W{ , Wj ) such that W{ (resp. Wj) is the first (resp. last) element of Ut-h that belongs to 
tt/(c, t). If ir/(c,t) is empty then RANGE/{c) equals 0. At stage t of the algorithm, I must 
update the RANGE/ array so that it changes from being a description of the tt/(c, t - l)'s 
to being a description of the tt/(c, f)’s. The rest of this subsection need only show how such 
an update is done in constant time by the |/| processors assigned to I . Of course, since 
we are ultimately interested in 0_\nB{ w ) for every w £ Lb, at the end of the algorithm we 


20 



must run a postprocessing procedure which recovers this information from the RANGE^nB 
array available at the root of T , i.e. it explicitly obtains 0AnB{w) for all w € U\ ogm . But this 
postprocessing is trivial to perform in O(logm) time with 0(m) processors, and we shall 
not concern ourselves with it any more. 

In the rest of this subsection, intervals L and R are the left and (respectively) right 
children of I in T. Observe that, for any destination w, &i(w) is one of 9l{ tv) or 9r(w). 
Furthermore, if 0/(u>) = 9l(w) then Oi(w') € L for every w' smaller than w (in the <b 
ordering). Similarly, if 9i{w) = 9r(w) then 0j(u/) € R for any w' larger than to . (These 
observations follow from Lemma 3.) 

The RANGEi array alone is not enough to enable I to perform the updating required 
at stage t. In addition, at each stage t, I must compute in a register called CRITIC ALi 
an entry Criticali{t) defined as follows. 

Definition 7 At each stage t, let the critical destination for I, denoted Criticali(t), be the 
largest w 6 U t -h such that 0i(w ) = 0l(w). If there is no such w (i.e. if 0i(w) = 0r(w) for 
all w 6 Ut-h), then Critical i(t) = 0. 

Note that Lemma 3 ensures that Criticalj(t) is well defined. We shall later show how 
storing and maintaining this critical destination enables I to update the RANGEi array 
in constant time. Of course it also places on I the burden of updating its CRITIC AL\ 
register so that after stage i it contains Criticali(t) rather than Criticali(t — 1). We shall 
later show that updating the CRITIC ALi register can be done in constant time as well. 

We now complete this subsection by explaining how I performs stage t , i.e. how it obtains 
Criticali(t ) and the iri(c,t)' s using the itl(c, t — l)’s, the itR(c,t — l)’s, and its previous 
critical index Criticali(t - 1). The fact that the |/| processors can do this in constant time 
is based on the following three observations: 

Critical i(t) is either the same as Criticalj(t — 1), or the successor of Criticali{t — 1) in 
Ut-h ■ (5) 

If c € L then it /(c,t) = itL(c,t - 1) - {the elements of X£,(c,t - 1) that are larger than 
Critical /(t) in the <b ordering}. (6) 

If c 6 R then 7r/(c, t) = tr(c, t - 1) — {the elements of xr(c, t - 1) that are less than or 
equal to Critical i(t) in the <r ordering}. (7) 

Correctness of (5)-(7) follows from the definitions. Their algorithmic implications are dis- 
cussed next. 
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Updating the CRITICALj register 

Relationship (5) implies that in order to update CRITICALj (i.e. compute Criticalj(t)) 
all I has to do is determine which of Critical j(t — 1) or its successor in Ut-h is the correct 
value of Critical j(t). This is done as follows. If Critical j(t-l) has no successor in U t -h then 
Criticalj(t - 1) = w m and hence Criticali(t) = Criticalj{t - 1). Otherwise the updating 
is done in the following two steps. For shorthand, let r denote Critical j(t — 1), and let s 
denote the successor of r in Ut-h- 

• The first step is to compute #l(s) and Or(s) in constant time. This involves a search 
in L (resp. R) for the crossover c in L (resp. R ) whose 7Tj,(c,i - 1) (resp. x/?(c, t - 1)) 
contains s. These two searches in L and R are done in constant time with the |/| 
processors available. We explain how the search in L is done (that in R is similar and 
omitted). I assigns a processor to each c £ L, and that processor tests whether s is 
in xj,(c, t — 1); the answer is “yes” for exactly one of those |£| processors and thus 
can be collected in constant time. Thus I can determine &l(s) and 9/t(s) in constant 
time. 

• The next step consists of comparing which of the following two paths to s is better: the 
one through &l(s), or the one through $a(s). If the path through 0 r(s) is the better of 
the two then Critical j(t) is the same as Critical j(i — l ) and the CRITICALj register . 
stays the same (containing r). Otherwise Criticalj(t ) is s, and we set CRITICALj 
equal to s. This comparison of the two paths and resulting update are done in constant 
time (by one processor, in fact). 

We next show how the just computed Critical j(t) value is used to compute the x/(c, t)’s 
in constant time. 

Updating the RANGEj array 

Relationship (6) implies the following for each c € L: 

1. If 7Tl(c, t — 1) is to the left of Critical j{t) then x/(c, t) = X£,(c, t — 1). 

2. If X£,(c,t — 1) is to the right of Critical j(t) then x/(c, t) = 0. 

3. If X£,(c,t— 1) contains Critical j(t) then it consists of the portion of X£,(c, t - 1) up to 
(and including) Criticalj(t). 
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Figure 8. Illustrating the second stage of the computation. 

The above facts 1-3 immediately imply that 0(1) time is enough for \L\ of the |/| processors 
assigned to I to compute 7T/(c, t) for all c £ L, by adjusting the RANGEi(c) value according 
to rules 1-3 above (recall that the X£,(c,t - l)’s are available in L from the previous stage 
t — 1, and Criticali{t ) has already been computed and is in the CRITICALi register). 

A similar argument shows that relationship (7) implies that |ii| processors are enough 
for computing tt/(c, t) for all c £ R. Thus I can update its RANGEi array in constant time 
with |lj processors. This completes the proof of Lemma 5. 

6.2. Obtaining all rows of DISTauB 

This subsection shows that 0(m 2 / logm) processors and 0(m 2 ) space suffice for computing 
in O(logm) time all the 9(v, w)'s (hence for computing the DISTauB matrix). Let La and 
Lb be as in previous sections. Our task is to compute 9{v , w) for all v £ La and all w £ Lb- 
We use S(L, k) to denote the ^-sample of a list L. 

In the first stage of the computation, we assign m log m processors to each v £ S(La, log 2 m). 
Then, in parallel for all v € S(La, log 2 m), we use the method of the previous subsection 
to obtain 0(v,w) for all w £ Lb - This first stage of the computation takes O(logm) time, 
0(m 2 ) space, and 0(m 2 /logm) processors, and obtains tf(u,u>) for all v G S(La, log 2 m) 
and w £ Lb- 

Id. the second stage of the computation, we assign 2m processors to each w £ 5(Ls, log m), 
with the task of computing 0(v,tu) for all v £ La- These 2m processors perform this com- 
putation for their particular w in O(logm) time, as follows. The set of m/log 2 m values 
{0(v, w) | t> £ 5(T / i,log 2 m)} partitions the common boundary of A and B into m/ log 2 m 
pieces J x , J 2 , - - - (see Figure 8). Let I u / 2 , ... be the m/ log 2 m pieces (of size log 2 m each) 
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into which S(La, log 2 m) partitions La (see Figure 8). Partition the group of 2m processors 
assigned to w into m/log 2 m subgroups, where the t'-th subgroup contains log 2 m + )»/» | 
processors whose task is to compute, for all v G which element of 7, equals 6(v, w). This 
subgroup of log 2 m + |J,| processors does this as follows. 

1. It gives each of the logm elements of 5(7, •, logm) (say, to element v) 1 + |7,|/logm 
processors that v uses to find out, in O(logm) time, which element of 7, equals 
8(v,w). The set of logm values (0(u,u>) | v G S(7,-,logm)} partitions 7, into logm 
pieces 7,,i , Ji,2, . . .. Let 7, tl , 7,-, 2 , ... be the log m pieces (of size log m each) into which 
5(7,-, logm) partitions 7,-. 

2. It partitions its log 2 m + (7;| processors into logm subsubgroups, where the k- th 
subsubgroup contains log m+|7,-jt| processors whose task is to compute, for all v £ 7,,jt, 
which element of 7,,* equals 8(v,w). This subsubgroup of logm-f-|7,-,jt| processors does 
this in (?(logm) time by giving to each of the logm elements of (say, to element 
v) 1 + | Ji,k I /logm processors that v uses to find out, in O(logm) time, which element 
of 7,-jt equals 0(v, w). 

In the third stage of the computation, we assign 2m/-v/logm processors to each v 6 
S(La, \/iog m), with the task of computing 0(v,w) for all w € 7g. These 2m/i/iogm pro- 
cessors perform this computation for their particular t- in Oilogm) time, as follows. The set 
of m/ logm values {0(r, w) | w € S(7s,logm)} partitions the common boundary of A and B 
into m/logm pieces J\, 7 2 , — Let 7 i, 72, . . . be the m/logm pieces (of size logm each) into 
which S (Lb, logm) partitions Lb- Partition the group of 2m/v/log m processors assigned 
to v into m/logm subgroups, where the i-th subgroup contains \/log m + |7 t |/\/logm pro- 
cessors whose task is to compute, for all w € 7,, which element of 7,- equals 6(v,w). This 
subgroup of v^logm + |7,|/\4ogm processors does this as follows. 

1 . It gives each of the \/log m elements of S(I{, vlogm) (say, to element w) 1 + 1 7,- 1 / log m 
processors that w uses to find out, in O(logm) time, which element of 7,- equals 6(v , w ). 
The set of v^ogm values {0(v, w) | w € 5(7,- , v/logm)} partitions 7,- into v/logm pieces 
Ji,U 7,' t 2, — Let 7 t) i, 7 t) 2, . . . be the y/logrn pieces (of size y/Iogm each) into which 
5(7,-, >/logm) partitions 

2. It partitions its \/logm + |7,|/-v/logm processors into -/[ogm subsubgroups. The k - th 
subsubgroup contains 1 + |7,,fc|/v/log m processors whose task is to compute, for all 
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w 6 Ii,k, which element of J,,* equals 6(v, w). This subsubgroup of 1 + | J,,jt|/\/log m 
processors does this in O(logm) time as follows: 

(a) If | «/*,*( > logm, by giving to each of the x/log m elements of (say, to ele- 
ment w) logm processors that w uses to find out, in O(logm) time, which 

element of J,,* equals 6(v, w). 

(b) If | J{,k\ < log m, by partitioning /,,* into 1+| yflogm equal pieces /j.fc.i, • • . 
(each of size roughly logm/IJ.jtQ and giving each /,,*,/ one processor. This 
processor sequentially finds ff(v,w) for all w € Ii,k,l in O(logm) time, since 

= O(logm). 

The fourth stage of the computation “fills in the blanks” by actually computing 0(v, w) 
for all v 6 La and w 6 Lb - It does so with only m 2 /logm processors by exploiting what was 
computed in the previous stages. Partition La into m/y^logm contiguous blocks X\,Xi, • . . 
of size -y/log m each. Similarly, partition Lb into m/v/logm contiguous blocks Yi,Y 2 , ... of 
size ylog m each. Let Z,j be the interval on the boundary common to A and B that is 
defined by the set of 0(v,w) such that v 6 Xi and w € Yj. Of course we already know 
the beginning and end of each such interval Z tJ (from the second and third stages of the 
computation). Furthermore, we have the following: 

Lemma 6 £W\/l°s m ^m/yiogm ^ = 2/v^T). 

Proof. First, observe that and Zi+ij+i are adjacent intervals that are disjoint except for 
one possible common endpoint (the rightmost point in Z tJ and the leftmost point in Z t+1 J+1 
may coincide). This observation implies that for any given integer 6 (0 < |£| < m/v/logm), 
we have: (It is understood that | Z,j | = 0 if j < 1 or j > m/v/logm.) 

m/v/logm 

52 \Zi, i+ s\ = 0(m). 

i=i 

The lemma follows from the above simply by re-writing the summation in the lemma’s 
statement: 

m/v/logm m/v/logm 

52 52 \Zi*+s\. • 

S=- m/v/logm 1=1 

The above lemma implies that with a total of m 2 /logm processors, we can afford to 
assign a group of 1 + 1 Z tJ |/v/logm processors to each pair X{,Y : . The task of this group is to 
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compute 6(v, w ) for all v G A”, and w G Yj (of course each such 9(v, w) is in Z,y). It suffices to 
show how such a group performs this computation in O(logm) time. If |Z,j| < >/log m then 
a single processor can solve the problem in 0((v/logm) 2 ) = 0(log m) time, by the quadratic 
work method of Section 3. If |Z,y| > y/iogm then we partition Z,y into |Z,j|/>/Togm pieces 
J\, J 2 , . . . of size ■ v /Tog"m each. We assign to each Jk one processor which solves sequentially 
the sub-problem defined by X{,Jk,Yj, i.e. it computes for each v G -A, and w G Yj the 
leftmost point of Jk through which passes a path that is shortest among the v-to-w paths 
that are constrained to go through Jk . This sequential computation takes O(logm) time 
(again, using the method of Section 3). It is done in parallel for all the Jk's. Now we must, 
for each pair v,w with v G AT,- and w G Yj, select the best crossing point for it among the 
|Z,y|/\/logm possibilities returned by each of the above-mentioned sequential computations. 
This involves a total (i.e. for all such v, w pairs) of O ( | Xi | \Yj \ \ Z,j | /\/log m) = 0(|Z,j| viog m) 
comparisons, which can be done in O(logm) time by the | Z,-j | /\/log m processors available 
(Brent’s Theorem). 

7. CRCW-PRAM algorithm 

This subsection briefly sketches how the partitioning schemes of Subsection 6.2 translate 
into a CRCW-PRAM algorithm of time complexity 0(log n(loglog m) 2 ) and processor com- 
plexity 0(mn/ log log m). Again, it suffices to show how DISTX-jB can be obtained from 
DISTa and DISTb in 0((loglogm) 2 ) time and with m 2 /loglogm processors. 

The procedure is recursive, and we describe it for the more general case when DIST a 
is l x h and DISTb is h x l (that is, \La\ = \Lb\ = t and the common boundary has 
size h ). It suffices to show that for any integer q < h of our choice, th/q processors can, 
in 0{{q + log log h) log log l) time, compute 6{v,w) for all v G La and w G Lb- If we can 
show this then we are done because we can choose q = log log h, and we have l — m and 
h = m/2. 

The first stage of the computation partitions La into \Tt contiguous blocks X\,Xi , . . . 
of size VI each. Similarly, Lb is partitioned into \fl contiguous blocks Y\, Y ^, . . . of size VI 
each. For each pair v, w such that v is an endpoint of an Xi and w is an endpoint of a Y } , we 
assign h/q processors (we have enough processors because there are 0(1) such pairs). These 
processors compute, in 0(q + log log h) time, the point 0(v, w). Thus, if we let Z,y denote 
the interval on the boundary common to A and B that is defined by the set 0(v, w) such 


26 


I 


that v € X, and w € Yj, then after this stage of the computation we know the beginning 
and end of each such interval Z,j . 

The second stage of the computation “fills in the blanks” by doing, in parallel, £ recursive 
calls, one for each X , , Yj pair. The call for pair X,, Yj returns 9(v,w) for all v 6 A', and 
w 6 Yj (of course each such 9{v, tn) is in Z,j). 

The time and processor complexities of the above method satisfy the recurrences: 

! T(i,h)<T(y/i,h) + e l (q + \og\ogh) t 

P(£, h) < ma x{c 2 th/q, £ P(Vt, |Z 0 |)}, 

where c\ and Cj are constants. The time recurrence implies that T(£, h ) = O((g+loglogh) log log C). 
That the processor recurrence implies P(£,h) = 0(£h/q ) becomes apparent once one ob- 
serves that J2i,j \Zij\ = 0{hy/l). The proof of this last fact is similar to that of Lemma 6: 

J2i,j | Zij | is re-written as Yli,s \Zi,i+s\ < = O(hVI). This completes the proof of the 

claimed CRCW-PRAM bound. 

Of course the same algorithm as above yields different complexity bounds when one 
uses in it other CRCW-PRAM methods for computing the min of h objects. For example, 
one can compute the min of h objects in 0(k ) time using h 1+2 ~* processors on a CRCW- 
PRAM, where k is any integer of one's choice. If such a method is used in conjunction with 
the above algorithm, then the algorithm runs in 0(/:lognloglogm) time with 0(nm 1+2_ *) ' 

i 

processors. 

8. Conclusion 

l 

! We gave a number of PRAM algorithms for the string editing problem. The algorithms were 

fast and efficient, but the best time x processors bound was still a factor of log n away from 
the 0 (li|) 2 /|) time complexity of the best serial algorithm for the problem. 

Acknowledgement. The authors are grateful to the referees for their careful reading and 
useful comments. 
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